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Abstract 

We examined whether or not the predictive ability of genomic best linear unbiased prediction (GBLUP) could be improved 
via a resampling method used in machine learning: bootstrap aggregating sampling ("bagging"). In theory, bagging can be 
useful when the predictor has large variance or when the number of markers is much larger than sample size, preventing 
effective regularization. After presenting a brief review of GBLUP, bagging was adapted to the context of GBLUP, both at the 
level of the genetic signal and of marker effects. The performance of bagging was evaluated with four simulated case 
studies including known or unknown quantitative trait loci, and an application was made to real data on grain yield in 
wheat planted in four environments. A metric aimed to quantify candidate-specific cross-validation uncertainty was 
proposed and assessed; as expected, model derived theoretical reliabilities bore no relationship with cross-validation 
accuracy. It was found that bagging can ameliorate predictive performance of GBLUP and make it more robust against over- 
fitting. Seemingly, 25-50 bootstrap samples was enough to attain reasonable predictions as well as stable measures of 
individual predictive mean squared errors. 
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Introduction 

A method for whole-genome enabled prediction of quantitative 
traits known as GBLUP, standing for "genomic best linear 
unbiased prediction", was seemingly suggested first by Van Raden 
[1-2]. In GBLUP, a pedigree-based relationship matrix among 
individuals [3] is replaced by a matrix valued measure of genomic 
similarities constructed using molecular markers, such as single 
nucleotide polymorphisms (SNPs) [4]. This "genomic relation- 
ship" matrix or variants thereof [5-6], referred to as G hereinafter, 
defines a covariance structure among individuals (even if 
genetically unrelated in the standard sense), stemming from 
"molecular similarity" in state at additive marker loci among 
members of a sample. Given G and values of some needed 
variance components, one can use the theory of best linear 
unbiased prediction (BLUP) to obtain point and interval (e.g., 
prediction error variances) estimates of the genetic values of a set 
of individuals as marked by the battery of SNPs. While G can be 
constructed in different manners we do not address this issue here. 
However, we note that it is possible to separate "genomic" from 
"residual" variance components statistically even in the absence of 
genetic relatedness. Hence, care must be exercised when relating 
the "genomic" to the "genetic" variance; this is discussed in [7]. 

Using theory developed by Henderson [8-10] it can be shown 
that GBLUP is equivalent to a linear regression model on additive 
genotypic codes of markers, with the allelic substitution effects at 



marker loci treated as drawn independently from a distribution 
possessing a constant variance over markers [11-13]. There is also 
an equivalence between pedigree-based BLUP or G-BLUP (or of 
models using both pedigree and marker relationships) and non- 
parametric regression [14]. For instance, if the nxp marker 
matrix X (w = number of observations, p = number of markers) is 
used to construct a kernel matrix XX , it can be established that 
GBLUP is a reproducing kernel HUbert spaces regression 
procedure [15-16]. Also, BLUP and GBLUP can be represented 
as linear neural networks where inputs are entries of the pedigree- 
based or G matrices, respectively [17]. Hence, GBLUP can be 
motivated from several different perspectives. 

There are many competing procedures for genome-enabled 
prediction, such as the members of the growing Bayesian alphabet 
[14], [18- 19], but most of these require a Bayesian Markov chain 
Monte Carlo (MCMC) implementation. On the other hand, 
GBLUP is simple, easy to understand, explain and compute, and 
there is software available for likelihood-based variance compo- 
nent estimation and for prediction of random effects. Also, 
GBLUP handles cross-sectional and longitudinal data flexibly and 
extends to multiple-trait settings in a direct manner. Further, 
GBLUP delivers a competitive predictive ability since members of 
the Bayesian alphabet typically differ by little in predictive 
performance and differences among methods are typically masked 
by cross-validation noise [19-23]. Last but not least, some 
members of this alphabet produce predictions that are sensitive 



PLOS ONE I www.plosone.org 



1 



April 2014 I Volume 9 | Issue 4 | e91693 



Bagging Genomic BLUP 



to h\pt*r-parameter specification [24]. Given these considerations, 
GBLUP or extensions thereof [25] are good candidates for routine 
whole-genome prediction in animal and plant breeding applica- 
tions and possibly for prediction of complex traits in humans as 
well [7], [26]. 

The closeness between predictions and realized values depends 
mainly on three factors: prediction bias, variance of prediction 
error and amount of noise associated with future observations. 
The latter cannot be reduced by any prediction machine based on 
training data, so it is impossible to construct predictors attaining a 
perfect predictive correlation, even if the model holds. Theoretical 
and empirical results [1], [27-30] indicate that the proportion of 
(cross-validation) variance explained by a linear predictor increases 
up to a point with training sample size, then reaching a plateau. 
However, when a small number of individuals is available, any 
prediction machine is bound to produce predictions with a large 
variance. In this context, it seems worthwhile to explore avenues 
for enhancing accuracy (i.e., reduce bias) and reliability (i.e., 
decrease variance) of predictions when training size is small. 

The question examined here is whether or not the predictive 
ability of GBLUP can be improved by recourse to resampling 
methods used in machine learning. These include bootstrap 
aggregating sampling ("bagging") and iterated bagging or 
"debiasing" [31-34]. Bagging uses bootstrap sampling to reduce 
variance and can enhance reliability and reduce mean squared 
error [31]. The conditional bias of GBLUP [19] cannot be 
removed by bagging, but iterated bagging has the potential of 
reducing variance while removing bias simultaneously. This study 
investigates bagging of GBLUP, with consideration of debiasing 
deferred to a future investigation. The second section of this paper 
gives a review of GBLUP and of its inherent inaccuracy (bias). The 
third section describes bagging in the context of GBLUP at the 
level of the genetic signal and of marker effects. The fourth section 
examines the performance of bagging in four simulated case 
studies, and the fifth section presents an application to real data on 
grain yield in wheat planted in four environments. The paper 
concludes with a discussion and with a proposal of a metric aimed 
to quantify candidate-specific cross-validation uncertainty. 

Materials and Methods 

GBLUP 

Idealized conditions. Assume that effects of nuisance factors 
(e.g., year to year variation) have been removed in a pre- 
processing stage (this can also be done in a single-stage, but we 
ignore this for simplicity). GBLUP can be motivated by positing 
the linear regression model on markers 

y=X/?+e, (1) 

where y is an n x 1 vector of observations or pre-processed data 
measured on a set of individuals or lines; X is an « x /) matrix of 
marker genotypes, with its topical element Xij being the genotype 
code at locus j observed in individual /, and with 
rank{X) <rmn(n,p); fi is a px\ vector of unknown allehc 
substitution effects when marker genotypes are coded, e.g., as 
— 1,0 and 1 for aa, Aa and A A at locus A, say, or when these 
coded values are deviated from the corresponding column means 
or standardized. Above, e~A^(0,Nf;^) is a vector of residuals 
where is the residual variance and N is an « x « diagonal matrix 
with typical element Nu; if y consists of single measurements on 
individuals, Na = 1 for all i= 1,2, ...,n, and if y is a vector of means, 
Nu would be the number of observations entering into the mean. 
In BLUP, a distribution is assigned to fi and the simplest one is 



N (^0,l(jjjj , where ct^ is the variance of marker allele 

substitution effects. Using this assumption together with model (1) 
gives as marginal distribution of the data (after assuming that X 

and y have been centered) y|X-iv(^0,XX'o^ +No^j . In BLUP 

ajj and are treated as known but these parameters are typically 
estimated from data at hand [35] . With markers, most ()ft(;n n<p 
so it is convenient to form the best linear unbiased predictor of)? as 
fi = (jjx'\-^y, where V = XX'f7^ +N(72. If model (1) holds, it can 

be shown that fi is unbiased in the sense that E(ji\X^ =E{fi)=0. 
Its covariance matrix (given X) is 

V^ = a;^\-'\, (2) 

and its prediction error covariance matrix (also covariance matrix 
of the conditional distribution of fi given y and X under normality 
assumptions) is 

V/;|y = V^-V^ = a2(l-<x2x'v->x), (3) 
where VjS = IfT^. The diagonal elements of V^|y (v^y) lead to a 

Vii 

measure of reliability of prediction of marker effect j: relj = 1 ^. 

A matrix of reliabilities and co-reliabilities is 

R = I-V/»|yV^-'=V^V^-'. (4) 

If one wishes to predict a future vector y^ = Xyjff-|-e/, with future 

residuals independent of past ones and provided that future and 
past residuals stem from the same stochastic process, under 
normality assumptions the predictive distribution [19] is 

y^|y,X,X^~iv(x^^,X^V^I,x;+V(7^j 

Further, if y^ is the predictand and BLUP(Xffi) = Xffi = yj- is the 
predictor, BLUP theory yields 

Var(yj-\y) = F«r(y^) - Far(y^) , (6) 

so that Var(yj'j =X/V^X|. If the marker effects could be 
estimated such that V^|y^O, then Kflc^yyJ — >X| V^Xy and 

Far^yylyj ^lyff^. Hence, the distribution in (5) would stiU have 

variance, indicating that it is not possible to attain a predictive 
correlation equal to 1 even if a training set has an infinite size 
(more formally, if the training process conveys an infinite amount 
of information about markers); [7] gives a discussion of related 
matters. 

Blup is conditionally inaccurate 

While BLUP theory is well established, quantitative geneticists 
tend to interpret the unbiasedness property of BLUP as if it 
pertained to the true unknown fi, when in fact it applies to the 
average of the distribution of fi, that is, 0. If jff in (1) were viewed as 
a model on unknown effects of known quantitative trait loci 
(QTL), it is obvious that one should think in terms of a fixed effects 
model, as per the standard finite number of loci model of 
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quantitative genetics [36] . Accordingly, if effects of markers are 
sought because these "flag" some genomic region of interest, the 
random sampling assumption made in BLUP is not relevant, 
although it might lead to a more stable estimator. In the fixed 
effects case both fi and the marked genetic signal Xy? are estimated 
with bias by BLUP even if the model holds [19]. 

Markers are not QTL and the latter are "causes" of generating 
a signal to phenotype. Suppose that the "true model" is linear on 
effects (q) of QTL relating to phenotypes via incidence matrix Q, 
that is 



the following squared correlation between the ith elements of jy 
and Qy-q (below Qy is the ith row of Qy) 



Var 



(Q/,,-q) _Q;,Kflr(q)Qy, 



Let now n—>co in which case BLUP theory yields 
Q}_,.Far(q)Q/_,-^Q^_,.Q/_,f72, so tiiat 



y = Qq + e. 



(7) 



If the QTL effects q are viewed as fixed entities (arguably 
geneticists have this in mind in their quest of finding genes), 
^(y|Q*l)=QQ- In this situation BLUP produces the following 
average outcomes 

£(^|q,Q,x) =E[<jjx'y-'y\q] =ajli!\-'Qq, 

and 

£(x^|q,Q,x) =i<[ff2xx'v-iy|q] =4xx'v-'Qq, 

so the bias of the estimated signal is 
Qq-<7^XX'v->Qq= [l-(7^XX'v-']Qq. 

On the other hand, if q is assigned a distribution, say, N^O,lrj^^ 
and the p markers are treated as random as well, e.g., 
;ff|(T|~iV^0,I(7|^, under normality assumptions one has 



E [q\fi,Q,X j = E{q) + Cov(q,fi '/? = v^^ers (8) 



This shows that it is impossible to attain a perfect predictive 
correlation even when tlu^ markers are the QTL. Further, 

Qf iQ/'; = X] 2f„) where Qf ij is the genotype at QTL locus j 

(j= 1,2. of individual i in the testing set. If QTL genotypes 
are centered and assumed to be in Hardy-Weinberg equilibrium 

^(^^ijj =^Pj{^~Pj}> so approximately 



1 + 



"1 



t,2pj{l-pj)a; 



(9) 



nq 



where pj is the frequency of a reference allele, ^ 2j7, (l ~Pj)o'q ^s 
additive genetic variance and 



where 



E2pj{l-pj)al 

ng 

T.2pj{\-pj)al + al 



Cov(q,^) = Cov(q,ff2y'v-iX) 



= <T2cov(q,q')Q'v-iX = a^a2Q'v-iX. 



Using this in (8) 



where QV^'X conveys unknown linkage disequilibrium rela- 
tionships between QTL and marker genotypes; similarly, the best 

statement about the signal is £'^Qq|;ff,Q,X^ =Q4narkers- Unfor- 
tunately, neither nor Q are known, so statements made about 
QTL from markers are based on untestable assumptions, 
including the view that the QTL effects and the genotypes are 
linearly related, as in (7). 

Predictive correlation when markers are the QTL 

Imagine a best case scenario where the markers are the QTL, 
and consider predicting jy = Qyq-|-e/'. Here, Qy^ is the incidence 
matrix relating QTL effects to yet to be realized phenotypes jy, 
and e/ is a future vector of residuals. BLUP theory, using Qy^q 
(here, q is the BLUP{q) under the true model) as predictor, gives 



is trait heritabUity. Hence, the predictive correlation has an upper 
bound at h. 

If instead of predicting individual phenotypes the problem is 
that of predicting an average, the upper bound for the predictive 
correlation is higher. The corresponding formula is easy to arrive 
at and it just requires replacing in (9) by ol/nAve, tiie number of 
observations intervening in the average. Then 



1 + 



nq 



Y.2pj{\-pj)a- 

7 = 1 



which is heritability as used in plant breeding, or "heritabUity of a 
mean" [36]. The predictive correlation never reaches 1 but can 
get close to 1 if riAve is very large. Values of squared predictive 
correlations in the range of 0.5-0.75 have been attained in dairy 
cattie progeny tests where predictands are processed averages of 
production records of cows sired by a bull [2] . 

Bagging GBLUP (BGBLUP) and marl<er effects 

Bagging GBLUP. "Bagging" exploits the idea that predictors 
can be rendered more stable by repeated bootstrapping and 
averaging over bootstrap samples, thus reducing variance [31]. 
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Bagging has been found to have advantages in cases wliere 
predictors are unstable, i.e., when small perturbations of the 
training set produce marked changes in model training [31], [34], 
[37]; for example, with ordinary least-squares under severe multi- 
colinearity. An important application of bagging is in prediction 
using random forests [38] . 

Prediction methods that use regularization, such as those 
applied in genome-enabled selection, are often stable because 
penalties on model complexity reduce the effective number of 
parameters drastically, thus lowering variance. However, this is 
attained at the expense of bias with respect to marker effects and to 
the unknown function to be predicted (marked genetic value). A 
priori it would seem that bagging would not bring advantages in 
the context of a regularized method such as GBLUP. However, 
this issue has not been examined so far and there may be cases, 
e.g., in "small" populations, where random variation in training 
sets of small sizes has a marked impact on predictive ability. 

To motivate bagging, we recall that GBLUP is a regression of 
phenotypes on genomic relationships between individuals. Let 
g = Xy? be a vector of "genomic values", GocXX and assume that 
G^' exists; then (1) can be written in equivalent form as 

y = GG-ig + e = Gg*+e (10) 

where g~N(^0,Gf7^j) and =G'^g~'N(o,G-^ aj^ . In scalar 

form, the datum for individual / is expressible under this 
parameterization as the regression 

ng 

J-^ ■ (11) 

= G;-g*+e, , 

where Gy is an element of the nxn matrix G and G, is its i row. 
From basic principles, 

i?L[/P(g*) = Cov(g*,y')V-V = ^^V-'y = r, 
and since Gg* = g, use of invariance gives 



A given ^j',,G,-j may not appear at all or may be repeated several 
times over the B bootstrap samples. The bagging algorithm is: 

• For each copy h {h= 1,2, ...,B) run GBLUP using estimates of 
variance components obtained from the entire data set (to 
simplify computations), find the regressions g^ and form a 
bootstrap draw for BLUP(g) as 

gb = Gbgl;b=l,2,...,B. (13) 

• After running the B GBLUP implementations take the 
following averages 

B 

Eg;, 

_ h=l 
gfiagged 1 

and 

B 

^bagged = '^^^ ■ (14) 

• Predict vector y^j.^, in the testing set as ^x^s, = GjesLTraingBagged 
where Gjest.Train is a matrix of genomic relationships between 
individuals in testing and training sets. If y-p^jt pertains to 
records on the same individuals the predictor is ^xest = ^bagged • 

To see how improvement arises consider the following 
argument. We seek to learn signal g from the GBLUP predictor 
g. Under the setting of BLUP (where g and y both vary at random 
over conceptual repeated sampling), the best predictor of g in the 
mean squared error sense is g because this is the conditional 
expectation under normality [39-40] . However, if g is the signal of 
a fixed target set of candidates, g is biased as shown earlier. Thus, 
£(g|g) = g + ^, where (5 is a vector of biases, and the mean squared 
error matrix of ^j,3ggj.j is 



BL UP{g) = GBL UP(g* ) = GV - 1 y = g 



(12) 



Note that c^jGV ' is an n x n matrix of "heritabHities" and "co- 
heritabilities"; the diagonal elements of G may be distinct from 
each other, so each individual may have a different heritabUity 
ascribed to it, as noted by de los Campos et al. (2013). 

The intuitive idea behind bagging was outlined in [1]. Suppose 
there is a large number of training samples from the same 
population; by averaging over the predictions made from these 
samples we would end up with a reduction of variance but with the 
bias properties remaining the same as those from the predictor 
derived from a single training set. This large supply of training sets 
can be emulated by bootstrap sampKng: by averaging over 
samples, one gets "closer" (in the mean squared error sense) to the 
true value, on average. Technical details of why this works are 
given in Appendix S 1 . 

Let the predictor formed from a single training set of size A^Train 

be ^^G,-^ =G,- g*(f = l,2,...,A^Train)- Its variance can be lowered 

by taking B bootstrap copies of size A^xrain (i-C, sampling with 
replacement from the training set) and then averaging over copies. 



MSEU^,,^^Jg 



eU 



bagged " 



g £ ^, 



bagged ' 



bagged 

Now, if the B bootstrap copies of GBLUP are drawn from the 
same distribution, ^bagged same expectation as g and, 

therefore, the same bias: £^^bagged ~glg) Further, if the B 

copies are viewed as drawn independendy, the variance of ^bagged 
should be B times smaller than that of any g^. Thus 



bagged 



-dS 



Var{g) 



a formula for taking the correlation between samples into account 
is not available but the reduction in MSE would be obviously 
smaller than in the idealized situation where samples are 
independent. Hence, ^bagged h'^s the same bias of GBLUP but at 
best is B times less variable. If a prediction machine has little 
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variance, as it is the case for shrinkage methods with a large 
amount of regularization, predictive performance might be 
degraded [31], but whether this occurs in a partictilar problem 
or not can be assessed empirically only. 

Bagging marker effects 

Alternatively, one can work directly with the linear model (1), 
but this is more involved computationally because the problem 
becomes a />-dimensional one (in GBLUP there are n<p 
unknowns). Assuming centered data, the ridge regression estimator 
(BLUP of marker effects) is 

;ff=(x'x+I/l)" X'y, 

where some ^ = (in the BLUP framework) is available, 

estimated from data at hand or chosen over a cross-validation 
grid. Given the data matrix D = [y,X], of order « x (1 +/>), draw B 
bootstrap copies by randomly sampling n rows from D with 
replacement, such that a particular bootstrap sample is 
Di, = [yj,Xi]. The bagged ridge regression estimator is 

b=l 

Then, given an out of sample case with marker matrix Xxest> the 
yet-to be realized phenotypes are predicted as YBagged = ^Test 

^Bagged- 

Using the relationship g = XyJ, one can obtain "indirect" 
samples of the bootstrap distribution of g as 

Xi,(^^l,Xi, + lXj X!i,yf, = Hhyi, (where 

Hj = X/, ^XjXft + 1/?,) Xj is the "hat" matrix for sample b) and 
form the "indirect" bagged GBLUP as 

1 ^ 

gBagged,indirect = "d X/ " ^ ^ 



Results 

Simulated case studies 

Case Study 1: model training with 200 known QTL and 
iVTram = 500. To evaluate the impact of bagging on model 
training, we simulated 200 QTL with known binary genotypes (as 
in inbred lines, where genotypes at a given locus are either aa or 
AA). Genotypes were sampled with a frerjuency of 0.05 at all loci 
and their effects were drawn independentiy from a N{0,j) 
distribution; sample size was Afprain = 500 so all QTL effects were 
likelihood identified (estimable) in the regression model described 
below. Any resulting disequilibrium was due to finite sample size 
only. Plienotypcs \v(;r(; formed by summing tlu; product of QTL 
genotype codes (0,1) times their corresponding effects o\'er the 200 
QTL and then adding a random residual drawn from A^(0, 10); 
the "effective" heritabiUty (variance among simulated realized 
genetic values as a fraction of the total variance) attained in the 
simulation was 0.34. The "true" model was employed in the 
training process using the "true" variance ratio 1 = 20 and the 
effect of the number of bootstrap samples {B), each of size 



'^Train = 500, on the bagged predictor was examined by taking 
B=50, 200 and 500. While 5 = 25 or 50 is often adequate [31], a 
larger number of bootstrap samples is not harmful. 

Regressions of the 200 elements of q on either their ordinary 
least-squares estimates (OLS), BLUP-ridge regression (a = 20) and 
on a bagged mean obtained by averaging over bootstrap sample 
estimates of the BLUPs of q were calculated. This was also done 
for the regression of the 500 simulated genetic signals (g = Qq) on 
either GBLUP and on the average of the GBLUP bootstrap 

samples (gBaggcd) ■ Table 1 presents results. As expected from 
BLUP theory [10] the regressions of either q or g on their 
corresponding BLUPs were near their expected value: 1. By 
construction, Var[BLUP{q)] = Cov{q,BLUP{q)), so the expected 
regression is necessarily 1. For QTL effects, OLS, even though 
being an unbiased estimator of q, produced a regression of about 

0. 42. The bagging procedure produced regressions that exceeded 

1 , both at the level of the QTL effects and of the genetic signal g. 
From the point of view of goodness of fit to the data, ridge 
regression and bagging produced models that accounted for more 
variation of the training data than OLS: for OLS was about 
0.49 whereas it ranged between 0.51 and 0.53 for bagging and 
ridge regression. Increasing the number of bootstrap copies in the 
bag increased mildly with no sizable gain resulting from 
increasing B from 200 to 500. The regressions of g on gBagged were 
larger than 1 but the bagged means accounted for the same 
proportion of variation in true g values as goBLUP did, with 
R^ = 0.63 for tiie latter and for bagged GBLUP with 5=200 or 
500. 

Figure 1 (left panel) gives the distribution of estimates of the 200 
QTL effects within each of 4 bootstrap samples for the run with 
5 = 500 as well as within the vector of bagged means. As expected 
bagging produced less variability among individual effect esti- 
mates, as "extreme" values are tempered by the averaging 
procedure. The impact of bagging is seen in the right panel of 
Figure 1 : the variance among bagged means of individual effects 
was much less than the variance within any of the 500 bootstrap 
copies. Figure 2 (left panel) shows the variance reduction when 
bagging was applied to the GBLUPs of individuals: the horizontal 
line at the bottom is the variance among the bagged GBLUPs of 
the 500 individuals in the training sample. The right panel of 
Figure 2 shows that bagged GBLUP (BGBLUP) produces an 
understatement of genetic vEilues relative to what is predicted by 
GBLUP for individuals ranked as "high" by the latter, but an 
overstatement otherwise; however, the two procedures give 
aligned predictions of genetic values. Bagging r(;gr(:sses extreme 
GBLUP estimates towards their average, which might attenuate 
influence from idiosyncratic samples on which GBLUP is trained. 
Hence, bagging enhances the shrinkage towards 0 inherent to 
GBLUP. 

Case Study 2: model training with 1000 known QTL and 

^Train = 500. As in case 1, sample size was A'jrain = 500 but 1000 
QTL with known binary genotypes and unknown effects were 
simulated. Here n<rank(Q) so least-squares cannot be used due 
to lack of estimability. The setting was as in case 1, but "effective 
heritability" was 0.68, that is, twice as when 200 QTL were 
simulated. The "true" model was employed in the training process 
using the "true" variance ratio A = 20, and the number of 
bootstrap samples for bagging was 5 = 25,50 or 100. 

Results are presented in Table 2. The regressions of q on qnagged 
were much larger than 1 and the slope seemingly stabilized at 
about 1.32 with a bag consisting of 50 bootstrap copies. As 
expected, the regression of q on qj^dge was near 1 . However, the 
proportion of variation in true q accounted for by the variation in 
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Table 1. Regression coefficients of true QTL effects (q) on their ordinary least-squares (qoLs)- fidge regression BLUP ^qRidgej and 
bagged ridge regression BLUP estimates ^qsagged)' ^i^d regressions of true genetic signal (g) on genomic BLUP (goBLUp) ^i^d 
bagged genomic BLUP (gBagged) varying number of bootstrap samples (B). 







No. Bootstrap samples^ 


B = 20 


B = 200 


B = 500 


Regressions of q on | 


QOLS 


0A2(R^=0A9) 








QRidge 


0.98 (J?^=0.53) 








^Bagged 




1.05 (;?-=o.5i) 


1.07(J?^ = 0.53) 


1.08 (i?2=0.53) 


Regressions of ^ on | 


Sgblup 


0. 99 {R- =0.63) 








SBagged 




1.05(J?- = 0.61) 


im(R^ = 0.63) 


1.08(J?2=0.63) 



R^ is the coefficient of determination of the regression fitted. The simulation involved a training set of 500 individuals with 200 true additive QTL fitted in the model. 
doi:1 0.1 371 /journal.pone.0091 693.t001 



bagged or ridge regression estimates was smaller than in case 1, 
with i?^ ~ 0.29 — 0.30 for bagging and ridge regression versus 
i«2^0.51-0.53 in case 1. On the other hand, bagged GBLUP 
accounted for about 73% of the variation of the true genetic values 
g, providing a "fit to signal" similar to that of GBLUP. The 
regression of g on goBLUP was also near its expected value of 1 , 
whereas true differences in genetic values among individuals were 
exaggerated by a factor of about 1.25-1.27 by bagging. The left 
panel in Figure 3 shows the alignment between 4 randomly chosen 
bootstrap samples and the 500 GBLUP estimates obtained in 
^Train- The right panel illustrates clearly that bagging (5=100) 



N-500, QTL=200, B=500 




N-500, QTL=200, B=500 




8 ^^'^ °„°'^o° 



100 



"T 1 — 

300 

Index 



— r 

500 



"pulls down" individuals with large GBLUP values and "pulls up" 
those placed at the left tail of the distribution of GBLUPs. 

Case Study 3: model training with 20 unknown QTLs, 200 
markers and A^xrain = 500. The setting was as in case 1 
(A'xrain = 500) but here the genetic signal was generated from 
q = 20 QTL with unknown binary genotypes that were in linkage 
disequilibrium (LD) with p = 200 binary markers as specified 
below. The true model was 

y = Qq + e, 

where q is a 20 x 1 vector of allelic substitution effects. QTL 
genotypes were equally frequent at all loci and their effects were 



N=500, QTL=200, B=500 



N»500, Q'n.=200, 8=500 



ooo 



ooo 
" o 
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300 

Index 



500 




GBLUP 



Figure 1. Simulation with 200 known QTL, 500 individuals in 
the training sample and 500 bootstrap copies for bagging. Left 
panel: distribution of 200 effects vj/ithin each of 4 bootstrap samples (1- 
4), and within the average (bag) of 500 samples (5). Right panel: 
distribution of variance among 200 effects within each of 500 bootstrap 
samples and within their average (item 501, flagged with arrow). 
doi:1 0.1 371/journal.pone.0091 693.g001 



Figure 2. Simulation with 200 known QTL, 500 individuals in 
the training sample and 500 bootstrap copies for bagging. Left 
panel: variance (VAR) among 500 GBLUPs in each of 500 bootstrap 
samples. The horizontal line gives the variance among bootstrap 
(bagged) means for the 500 individuals. Right panel: scatter plot of 
bagged GBLUP (BGBLUP) versus exact GBLUP for 500 individuals. 
doi:1 0.1 371/journal.pone.0091 693.g002 
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N=500, 1000 QTL, B=100 




GStUP 



N=500, 1000 QTL, B=100 




GSLJP 

Figure 3. Simulation with 1000 l<nown QTL, 500 individuals in the training sample and 100 bootstrap copies for bagging. Left 
panel: relationship between GBLUP with the entire sample and GBLUPs in 4 bootstrap samples of size 500. Right panel: relationship between 
bagged GBLUP (BGBLUP) and GBLUP of the 500 individuals. 
doi:10.1371/journal.pone.0091693.g003 



1, 



drawn independently from a A^(0, -) distribution. Phenotypes 

were formed by summing the product of QTL genotype codes 
(0,1) times their corresponding effects over the 20 QTL, and 
adding a random residual drawn from A' (0,10). The method 
employed for training was ridge regression (BLUP) on 200 markers 



using the "true" variance ratio 1 = 20, and the number of 
bootstrap samples for bagging was 5=100. The simulation 
generated an effective heritabUity of about 0.19. 

LD was simulated statistically by introducing correlations 
among columns of the Q (QTL genotypes) and X (marker 
genotypes) matrices, respectively. This was done by drawing A^Train 



Table 2. Regression coefficients of true QTL effects (q) on their ridge regression BLUP (^qRygej and bagged ridge regression BLUP 
estinnates ^qsagged)' ^i^^ regressions of true genetic signal (g) on genomic BLUP (goBLUp) ^i^^ bagged genomic BLUP (^gBaggcd) 
varying number of bootstrap samples (B). 







No. Bootstrap samples^ 


B = 25 


B = SO 




B=WO 


Regressions of ^ on J. 


^Ridge 


0.97(J?2=o.30) 










iBagged 




1. 30(^2 = 0.29) 




= 0.29) 


1.32(J?2 = 0.29) 


Regressions of ^ on | 


Sgblup 


0.99(j?- = 0.73) 










gBagged 




1.25 {R^=0J3) 




= 0.73) 


1.27(/J2=0.73) 



is coefficient of determination of the regression fitted. 
The simulation involved a training set of 500 individuals with 1000 true additive QTL fitted in the model. 
doi:l 0.1 371/journal.pone.0091 693.t002 
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independent Beta{a,b) random variables corresponding to tlie 
rows of these matrices and then sampling a Bernoulli random 
variable (i.e., 0 for aa and 1 to AA, say) with probability of success 
given by the draw from the Beta distribution. Thus, columns of 
matrices Q or X (with entries 0 or 1) had a beta-binomial 
distribution (e.g., Casella and George, 1992) and an expected 
correlation equal to l/(fl + i+l); employing a = 0.30 and 

6 = 0.70 a correlation equal to - would be expected. Using this 

approach the "first" 10 QTL were in LD among themselves as 
well as in LD with the "first" 100 markers; these markers were in 

mutual LD themselves with a correlation of - also. On the other 

2 

hand, QTL 1 1-20 were in mutual linkage equilibrium as well as 
with all other markers. To Ulustrate, in the sample simulated 
realized genotypes at QTL 1 and 2 had a correlation of 0.48, 
whereas QTL 1 and 15 had a correlation of —0.02. Likewise, the 
correlation between markers 1 and 2 was 0.50, that between 
markers 1 and 200 was —0.04 and QTL 2 was correlated with 
marker 105 at 0.03. QTL genotypes at each locus were multiple- 
regressed on the 200 marker genotypes: for QTL 1—10 (in LD with 
markers) the B? of the regression of QTL genotypes on the 200 
marker genotypes was about 0.70 or larger. For the 10 QTL in LE 
with markers P? fluctuated around 0.40. This last result illustrates 
that even a null association accounts for some variation, merely 
because the likelihood increases monotonicaUy with model 
complexity (in this case there are 200 partial regressions of each 
QTL genotype on markers). 

The regression of phenotypes on QTL genotypes or on markers 
had an B? at 0.24 and 0.43, respectively, with the latter being 
larger simply because more parameters are fitted in the model. On 
the other hand, the squared correlation between true signal and 
fitted values was 0.87 for the QTL model versus 0.15 for the 
marker-based model: even though markers captured more 
variation (because of higher model complexity) than a regression 
on true genotypes, their ability of capturing signal was much less. 

We measured similarity among individuals by constructing 
"genomic correlation matrices". This was done by centering both 

Q and X, calculating QcenteredQcentered ^nd XcenteredXc5.ntered> ^nd 

converting these into correlation matrices Rg and R^f, respec- 
tively. A plot of the olf-diagonal elements of Rm versus those of 
Rg is shown in Figure 4 for the corresponding pairs of individuals. 
Although there is an association between genomic correlations at 
the QTL and marker levels, the latter ones were smaller in 
absolute values. This association was not perfect: at a given level of 
correlation at the QTL level, there was much variation in 
relationships when measured by markers. Implications of this on 
accuracy of genome-enabled prediction are discussed in [7] . 

)i 3 

GBLUP and BGBLUP were calculated at each of - ,X and - X 

2 2 

(with 1 = 20) as variance ratio, to investigate the impact of 
regularization on the ability of capturing "true" signal, Qq. The 
regression of signal on GBLUP was 0.48, 0.54 and 0.60 for the 
three values of the regularization parameter, respectively, whereas 
that for BGBLUP was 0.50, 0.59 and 0.65, respectively. This 
illustrates that the regression of "true values" on GBLUP 
predictions is not 1 when the model is incorrect, as it is the case 
when markers are not QTL, as simulated here. The corresponding 
B^ values were 0.25, 0.27 and 0.28 for GBLUP, and 0.24, 0.28 
and 0.30 for BGBLUP. While 1 = 20 is the "correct" regulariza- 
tion to be exerted at the QTL level, this is not so for the model 
based on markers, where a stronger degree of regularization is 
needed (the number of markers was larger than the number of 
QTL). An approximation is that the "correct" variance ratio for 




I 1 1 1 r 

-1.0 -0.5 0.0 0.5 1.0 

QTL 



Figure 4. Simulation with 20 unknown QTL, 200 marlcers and 
500 individuals: "genomic correlations" among 500 individu- 
als at QTL or marker loci. 

doi:1 0.1 371/joumal.pone.0091 693.g004 

the marker based model should be pjnif = 10 times larger than X, 
where riq = 20 is the number of QTL. We found (results not 
shown) that the regressions of signal on GBLUP and BGBLUP 
increased as larger values of the regularization parameter were 
applied to the marker based model, with the "optimum" being 
near \0X, as expected. 

Case Study 4: predictive cross-validation with 100 
unknown QTLs and 500 markers. The simulation posed 
100 unknown QTL whose additive effects were drawn from a 
N {0,1. 25) distribution and 500 individuals genotyped for 500 
markers. The LD structure was similar to that of case 3: the first 50 
QTL were in mutual linkage disequilibrium as well as in LD with 
the first 250 markers. QTL 51-100 were in LE among themselves 
as well with all markers; the first 250 markers were in mutual LD 
and markers 251-500 were in LE. Residuals were drawn from 
A^(0,10) and the "eflective" heritability attained was about 0.74. 

The 500 individuals were distributed at random into two non- 
overlapping training and testing sets with 250 members in each. 
This was done 100 times at random, producing 100 training- 
testing pairs enabling estimation of the cross-validation distribu- 
tion. GBLUP and BGBLUP were fitted to the training data 
(A^Xrain = 250) for cach of 13 values of the regularization 
parameter X in the sequence 

^scq=(^j^,^,^,^,l,2,4,8,16,32,64,128,256j xX,,j,,,„, (16) 

where /.//xme" = ^ with ff^ x (T^ — . In each training instance, 
ff(5 P 

BGBLUP was implemented with B = 25 bootstrap samples of size 

Axrain drawn from the training set of size 250. Three metrics were 

used to evaluate the two prediction methods: goodness of fit in the 

training set (correlation between fitted and observed values), 

predictive correlation (predicted phenotypes in the testing set and 

realized values) and predictive mean-squared error, that is, 

average squared difierence between predicted and reahzed values 

over the Ajest = 250 cases in the testing set. 
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The preceding involved 13 BGBLUP and GBLUP implemen- 
tations in each of the 100 random cross-validations, for a total of 
1300 comparisons. Overall (Figure 5), BGBLUP attained a better 
predictive performance than GBLUP because predictive correla- 
tions were typically larger (in some cases more than twice as large 
as GBLUP) and mean squared errors of prediction were lower as 
well. Thus, BGBLUP was more reliable (larger correlation) and 
more accurate (smaller mean squared error) than GBLUP. The 
superiority of BGBLUP over GBLUP became smaller when 
regularization was stronger over the grid defined by Iseq- 
However, it was not until logjQ Iseq became 5 or more times 
larger than (logio '^'Tme") that GBLUP "caught up" with bagging 
but was never better. Actually, it was not until the X value used for 
model training was 32 times larger than Ijmc that the two 
predictors delivered the same performance, suggesting a "robust- 
ness" property of BGBLUP that GBLUP seems to lack. Model 
complexity is reduced as X increases (the effective number of 
parameters decreases) so the variance of GBLUP decreases as well, 
in which case bagging offers litde help. Contrary to what was 
stated by [7] we did not find evidence that bagging damaged 
predictive performance under any of the regularization regimes 
entertained. Results for selected settings are discussed in the 
following paragraph. 

Figure 6 shows training and predictive correlations and mean 
squared errors for BGBLUP (y-axis) and GBLUP (x-axis) at 2 
levels of "under-regularization": /l = /lTrue/16 and A = ATrue/2. 
Here, where shrinkage is less than it should be, given the model, 
BGBLUP reduced overfitting (smaller training set correlations), 
increased predictive correlations and reduced mean squared 
errors, relative to BLUP. Differences were marked: in the upper 
(lower) middle panels of Figure 6 it is seen that the largest 
predictive correlation attained by GBLUP was smaller than 0.30 
(0.39) and that bagging increased the corresponding correlation to 
about 0.38 (0.41). The effect of bagging on reducing predictive 
mean squared was also clear. 

Figure 7 presents results when the "true" value of X (400) was 
used for training. GBLUP was again close to overfitting and 
BGBLUP reduced training correlations, thus tempering the 
problem. Predictively, BGBLUP was better than GBLUP in all 
100 comparisons, both in the correlation and MSE senses. Over 
the 100 cross-validation runs, the predictive correlation ranged 
between 0.195 and 0.391 (median 0.282) for GBLUP and between 
0.243 and 0.423 (median 0.330) for BGBLUP. MSE of prediction 
ranged between 33.64 and 46.66 (median 38.47) for GBLUP and 
between 30.24 and 43.41 (median 3,5.01) for BGBLUP. Compar- 
ing the medians of the distributions, BGBLUP enhanced the 
predictive correlation by 1 7 % and MSE was 9 1 % of that of 
GBLUP. Figure 8 depicts what was found when "excessive" 
shrinkage was applied in training: the regularization parameter 
values were 8/tTruc (upper panel) and 16/?,Tme (lower panel). 
BGBLUP was only marginally better than GBLUP. Strong 
shrinkage rendered the training model exceedingly simple so both 
methods delivered similar same predictive ability. Differences 
between methods vanished when 256ATrue was used as shrinkage 
parameter. 

We repeated the experiment with the setting of case study 4 but 
increasing training and testing sample sizes to 2.500 each. Here, 
training sample size was 5 times larger than the number of 
markers: no differences between BGBLUP and GBLUP were 
found at any level of regularization. The reason is that n now 
exceeds p, making GBLUP fairly stable, in which case the variance 
reduction property of BGBLUP does not help much. 

In summary, in our simulations BGBLUP was typically better 
than GBLUP at most points of the regularization grid considered. 



Its performance was better than that of GBLUP at values close to 
"optimal" regularization, and differences were large when 
shrinkage was small, because bootstrap sampling with averaging 
reduced variance. If A is small, GBLUP tends to overfit and to be 
\'ariable but BGBLUP alleviates these problems. In addition to 
helping with overfitting, BGBLUP was robust with respect to 
departures from optimal regularization, e.g., to errors in the 
variance ratio. The experiment with a much larger training sample 
size than the number of markers indicated that the performance of 
bagging depends on /^/A^xrain^ we conjecture that as this ratio 
increases (as it will surely be the case with DNA sequence data) use 
of BGBLUP may enhance predictive ability in some real data 
situations, simply because overfitting and colinearity will be 
exacerbated by introducing a massive number of variates in the 
model. 

Analysis of wheat data 

The data set is at http://cran.r-project.org/web/packages/ 
BLR/index.html, in the BLR package in R, and has been used by, 
e.g., [17], [42-43]. The data represents 599 wheat inbred lines 
each genotyped with 1279 DArT (Diversity Array Technology) 
markers, and planted in 4 distinct environments. DArT markers 
may take on one of two values, denoting presence or absence of an 
allele. Records came from several international trials conducted at 
the International Maize and Wheat Improvement Center 
(CIMMYT), Mexico. The trait considered was average grain 
yield for each line in each of the four environments. This response 
variable is an a\'erage o\'er a balanced set of plots and replicates 
and, within environment, the residual variance is expected to be 
constant over lines. 

To provide "proof of concept", we used a ridge-regression 
BLUP model with 600 randomly chosen markers. All markers 
were not employed in order to facilitate calculations, since matrix 
inversion was used for every bootstrap sample and cross- 
validation. With a large number of markers or individuals or for 
routine application, GBLUP can be computed in using iterative 
algorithms, but this is a numerical, as opposed to conceptual, issue. 
Further, iVxrain = 449 and iVxcst = 599 - 449 = 1 50; the model was 
trained using a grid with 10 values of 1: 5, 10, 50, 100, 150, 200, 
250, 300, 350 and 400. All lines were present in each partition of 
the data into the two disjoint training and testing sets, with the 
process repeated at random 100 times, to estimate the cross- 
validation distribution. Each implementation of BGBLUP used 25 
bootstrap samples and performance was evaluated via predictive 
correlation, predictive mean squared error and mean absolute 
deviation between predicted and realized phenotypes. We found 
(results not shown) that the "optimum" X was around 50-100 in 
environments 1 and 2 with stronger regularization needed in 
environments 3 and 4. BGBLUP was slightly better than GBLUP 
in environment 1 near optimum regularization, and clearly better 
at the lower values of X because GBLUP nearly overfitted; a 
similar picture emerged in environment 2. 0\ erall, BGBLUP was 
better than GBLUP when X was below the optimum, sometimes 
slightiy better at the optimum, with no difference if regularization 
was excessive, due to the fact that the two models were rendered 
effectively simpler as X grew. 

The upper and lower panels of Figure 9 give results for 
environments 3 and 4, respectively; results for environments 1 and 
2 were similar. As found with the simulated data, over the 100 
repetitions x 10 values of A = 1000 comparisons, BGBLUP tended 
to produce larger predictive correlations and smaller mean 
squared errors than GBLUP. However, this superiority was not 
uniform and depended on whether or not the model was "under" 
or "over" regularized, with BGBLUP having slightiy better 
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Figure 5. Simulation with 100 unknown QTL, 500 markers and 250 individuals in each training and testing set: training 
correlations, predictive correlations and predictive mean-squared error (MSE) for 1300 comparisons between bagged GBLUP 
(BGBLUP, 25 bootstrap samples) and GBLUP. 

doi:10.1371/journal.pone.0091693.g005 



performance in the former situation but slightly worse in the latter. 
This is illustrated for environment 4 in the upper left panel of 
Figure 10, where average differences between BGBLUP and 
GBLUP over the grid of lambda values are shown for predictive 
correlations, predictive mean square error and average absolute 
difference between predicted and realized values. The other panels 
of Figure 10 indicate that, over the 100 cross-validations, 
BGBLUP was better at low values of X, slightly better at near 
optimum values of X and mildly worse when regularization was 
extreme (/. = 400). Hence, it would seem that BGBLUP performs 
at least as well as GBLUP unless a gross error is made in assessing 
the value of the regularization parameter in model training. Such a 
large error is unlikely if the variances are estimated from training 
data (unless the sample is small) or evaluated over a grid of suitable 
candidate values. One should be cautious about ehcitations of the 
regularization parameter based on simple theoretical arguments 
that may not hold. 

Discussion 

We examined whether or not bootstrap sampling in the context 
of GBLUP can enhance predictive ability in cross-validation. 



Simulation (with known or unknown QTL) and a wheat data set 
with grain yield information were used for this purpose. In the 
simulations, it was found that bagging BLUP estimates of marker 
effects or of genomic signal increased the slope of the regression of 
true marker or marked breeding value on predictor relative to 
what is expected under BLUP theory. When an individual was 
evaluated as "extreme" by GBLUP, bagging made the estimate 
less extreme. If the linear model entertained holds, the regression 
of true signal on GBLUP is expected to be 1 , but the regression on 
BGBLUP is steeper because the latter has smaller variance. This is 
easy to see: if 7" is a predictand and T is its BLUP, then 
Cov(T,T)= Var(T) so the slope of the regression of T on T is 
one. Now, if T^Bagged is the average of B bootstrap copies of T, the 
variance of iBagged is about B times (assuming samples are mildly 
correlated) smaller than that that of T, but Cov{T,T) = Cov 
(7">7Bagged)- Hencc, the regression must be larger than 1. 

It was also found that bagging conferred robustness to GBLUP 
because it is less prone to over-fitting and often delivered better 
predictions in terms of correlation and mean squared error even 
when regularization was "optimal". At least in simulated data. 
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Figure 6. Simulation with 100 unknown QTL, 500 markers and 250 individuals in each training and testing set: training 
correlations, predictive correlations and predictive mean-squared error (MSE) for 100 comparisons between bagged GBLUP 
(BGBLUP, 25 bootstrap samples) and GBLUP at two levels of "under-regularization". 

doi:1 0.1 371 /journal.pone.0091 693.g006 



BGBLUP was not inferior to GBLUP when shrinkage was beyond 
what it should be. 

Bagging allows estimating a cross-validation prediction error 
mean squared error for each subject tested. In theory, given 
training data, GBLUP in a testing set can be computed as 



SXrain " Gxrain Gxrain + iNtrain — V 
y = (iNtrain + Gj^l^ijj ^Optimum) y> 

with ^Optimum being some "optimum" value of the regularization 



parameter. If the problem is that of predicting a future set of 
records Jj^-st of the same individuals, the variance-covariance 
matrix of prediction errors (under normality assumptions) is 

^^«''(yTestlyTrain) = ^^aKglyTram) +1^?- 

Given that the assumptions hold, there are two sources of 
uncertainty here. The first is uncertainty about signal (breeding 
value) given training data, and the second is noise variability 
associated with the yet to be reahzed observations. The model 
derived (expected) reliabilities of the predicted genomic value 
values from the training data are given by the diagonals of matrix 
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Figure 7. Simulation with 100 unknown QTL, 500 markers and 250 individuals in eacii training and testing set: training 
correlations, predictive correlations and predictive mean-squared error (MSE) for 100 comparisons between bagged GBLUP 
(BGBLUP, 25 bootstrap samples) and GBLUP at the "correct" level of regularization. 

doi:10.1371/journal.pone.0091693.g007 



i^G — I ~ (iNtrain + G-j-j,',i„/loptimum) 
din ji / ' 

where ^jmin is the marker matrix in the training set. Note that (17) 
does not make reference at all to realized outcomes (testing set 
phenotypes) or to realized predictions, so using the term 
"accuracy" in lieu of "reliability" is misleading. While the 
diagonal elements of £la may be close to 1, this does not give 
assurance tiiat predictions wiU be any good. The reliability matrix 
uses only information on marker genotypes (via the X matrix) and 
variance components, but do not exploit information on pheno- 
types. Importantly, the two measures ignore model uncertainty, 
thus exaggerating prediction reliability relative to what would be 
observed in a cross-validation distribution. Model goodness of fit 
statistics in training data lead to expectations that seldom translate 



into what is observed in cross-validation (e.g., [44]) and examples 
of this are in a study of human height with molecular markers by 
[7] and [45]. The problem of developing credible individual- 
specific measures of reliability in cross-validation has not been 
solved yet [30] but a practical solution can be arrived at by use of 
bagging. 

Let the frxed, observed, outcome (e.g., the mean of an inbred 
line of wheat, a daughter yield deviation of an artificial 
insemination buU or the phenotype of a subject) in a testing set 
be yi, i= l,2,...,A'Test, and let the prediction from GBLUP be (j),, 
so the realized prediction error is j), — 0,. In BLUP theory, 
predictor and predictand (the latter with eventual realized value yi) 
vary at random over conceptual repeated sampling, given 
some linear model, but here j,- is an observed realization from 
an unknown process. Using bagging, B bootstrap samples of the 
distribution of (j), are available, so one can form the set of 
prediction errors £, = ■{ — ^'''.j', — — 0,-^' l for 
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Figure 8. Simulation with 100 unknown QTL, 500 markers and 250 individuals in each training and testing set: training 
correlations, predictive correlations and predictive mean-squared error (MSE) for 100 comparisons between bagged GBLUP 
(BGBLUP, 25 bootstrap samples) and GBLUP at two levels of "over-regularization". 

doi:10.1371/journal.pone.0091693.g008 



i=l,2,...,N-Ysst- For each /, the bootstrap average squared 
prediction error associated with GBLUP (given j,-, Xxrain and 
Xjest) is assessed as 



BPEi -- 



B 



\i=l,...,Nj 



(18) 



noting that this squared cross-validation prediction error reflects 
both squared bias (unknown) and variance. Similarly, a cross- 
validation reliability measure can be constructed as 



BPE 

BPRELi = \ ^;/=l,...,7VTest, (19) 

^Test 

where 



VTest = 



E [y.-U 

1 = 1 



1 



BPREL takes values between 0 and 1 provided that BPE, < Vxest, 
which cannot be assured unless one replaces Vjest by, say, 
max. ,. iBPE,). A disadvantage of BPREL, is that it does 

/^Testing set V ■' / ^ 

not take into account the fact that, given X, all observations are 
expected to have a difTerent phenotypic variance, depending on 
how a genomic relationship matrix is constructed in GBLUP. 

Recall that GBLUP poses g\al~N ^0,Gf7^^ , leading to the testing 

set expected variance-covariance structure 



Vjest — Gxeslfg + Ixestfj, 
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Figure 9. Wheat data in environments 3 and 4: predictive correlations and mean-squared errors in 1000 cross-validations (100 
random partitions intro training-testing sets and 10 levels of the regularization parameter). 

doi:10.1371/journal.pone.0091693.g009 



In the absence of some scaling (with the latter having 
consequences on the definition of ct^) the diagonal elements of G 
vary over individuals, so the diagonals of Vjest vary as well; this 
does not occur in a pedigree-based model if all individuals have the 
same level of inbreeding. One way of taking this into account is to 
modify the "reliability" measure (19) into 



prediction error variances derived in the training process and 
bootstrap average squared prediction errors, which make use of 
both training phenotypes, via <^^P and realized values y,. Likewise, 
as shown in Figure 12 the empirical BPREL (top panel) and the 
adjusted reliabilities (bottom panel) are unrelated to model derived 
reliabilities. The adjusted rehabUities were calculated as 



BPREL, = 1 



BPEi 



<iwg,(VTest) 



(20) 



where diagjiy-Yzst) is the ith diagonal element of Vxest- 

We examined this proposal under the setting of case 4, with 100 
unknown QTL, 500 binary markers, -/Vxi-ain = A'xest = 250. In the 
simulation (results not shown), we found in cross-validation that 
the "optimum" X in terms of predictive correlation and mean- 
squared error sense was 3200. We trained the model using 
i= 1600, 3200 and 6400 and 5 = 25,50 and 100. Differences in 
BPEj obtained with the three values of B were very small and the 
three levels of regularization produced the same qualitative 
picture, with prediction mean-squared error increasing with 
stronger shrinkage. Figure 1 1 illustrates the disconnect between 



BPREL: = 1 



BPEi 

max [ Kar(>'Test),maXfeTesting set {BPEj) 



(21) 



with Var{yjgst) constant over the training set. When using 
BPREL median reliability (using 5 = 25) was estimated at 0.702, 
0.725 and 0.722 at the three level of regularization but a few ones 
were negative. After the adjustment in (21) all reliabilities varied 
mostly between about 0.40 and slightly less than 1 and these values 
were unrelated to theoretical reliabilities. Predictions were quite 
accurate, in general (recall that the same stochastic process was 
used to create training and training sets). 

There does not seem to be a theoretical reason leading to expect 
an agreement between model based reliabilities and measures of 
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Figure 10. Wheat data in environment 4. Upper left panel: average differences (over 100 cross-validations) between bagged GBLUP and GBLUP 
for predictive correlations (PCOR), mean-squared error (PMSE) and absolute value of differences (PABS) between prediction and realization at 10 
values of the regularization parameter. Upper right, lower left and lower right panels give the three metrics for lambda values of 5, 100 and 400, 
respectively. 

doi:1 0.1 371/journal.pone.0091 693.g01 0 



Test 



cross-validation performance, because the latter gauge difTerent 
things. The theoretical reliabihties, based on a model deduced 
quantity, are just indicators of the amount of information in the yxest ~ 

training data set without making reference to the "goodness" of 
any prediction. On the other hand, BPREL or variants thereof 
take into account "closeness" between prediction and realized 
value, with bagging enhancing the stability of the prediction. 
Hence, we argue that bagging is sensible because it reduces the gy. 
variance of GBLUP, seemingly without hampering predictive 
ability, and provides a means for ascertaining the (conditional) 
prediction bias in a strict sense. If BPEi is close to 0 the squared 
prediction error is small, so that the prediction has a small 
variance, a small bias, or both. Irrespective of the cause, the cross- 

vaUdation measure of rehabUity would be close to 1 . = 1^ 

To discuss influences that theoretical reliability may have on 
predicted values in a testing set (yxest)) 'lote that, when using ridge ^j^^^ 
regression BLUP, 



-^Train-^Train ' 



„ A 



X 



Train y Train 



The influence training data have on predictions via GBLUP can 
be measured by the derivative or "hat matrix" 



Test 



^Train^Train ' 



Ntrain 



-1 



^'yTrain 

Now, the matrix of "reliabilities of marker effects" is 



^(X^ra 



,x 



Train 



-1 
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Figure 11. Disconnect between expected prediction error 
variances (theoretical PEV) and empirical bootstrap average 
squared prediction errors (empirical PEV) for a simulation 
under the settings of case study 4. True lambda = 20. 
doi:1 0.1 371 /journal.pone.0091 693.g01 1 



1 



Hence, the predicted values can be seen to be less sensitive with 
respect to variation in training data when reliabilities (in this case 
of marker efiects, but the same logic carries for GBLUP) increase 
and when A gets larger; when i -» oo the model becomes essentially 
null as the effective number of parameters goes to 0. Informally, 
has an upper bound at Ip so, when reliabilities are perfect, 
predictions are insensitive with respect to variation in training 
data. However, even in this perfect case and assuming the model is 
correct, there is no clear connection between reliability and 
predictive outcome. 

Bagging did reduce the variability of GBLUP predictions and, 
as observed in our case studies, it enhanced predictive perfor- 
mance when the model was "under-regularized". When, regular- 
lzation was near optimum, bagging did not improve predictive 
performance, but it provided a means for assessing predictive 
mean squared error for any individual or candidate item in a 
testing set. This is because bagging can emulate variation in 
training data sets of a given size, allowing calculation of 
conditional (given Xxrain, Xjest and yjest) mean squared errors 
and of a measure of "reliability" connecting directly to predictive 
outcomes. These measures reflect variation in the predictor 
(rendered small by bagging), prediction bias and, of course, noise 
inherent to the fact that prediction can never be perfect. We did 
not find that bagging deteriorated predictive performance in any 
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Figure 12. Relationship between expected reliabilities and empirical reliabilities (see text) in the top panel, and empirical adjusted 
reliabilities (see text) in the bottom panel. 
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of the settings simulated, with only a slight hint in the wheat data 
set when regularization was excessive. As anticipated by [31] 
bagging helped when the predictor was more variable, due to 
small shrinkage. Coupled with the finding that predictive 
performance was not degraded otherwise, it seems that bagging 
confers robustness to the GBLUP prediction machine. 

Conclusions 

In short, bagging ameliorated the predictive performance of 
GBLUP, providing a means for developing candidate-specific 
measures of cross-validation reliability. It is computationally 

intensive when one searches for an optimum value of A because 
of the simultaneous bootstrapping. In our study it seemed that 25- 
50 bootstrap samples were enough to attain reasonable predictions 
as well as stable measures of individual predictive mean squared 
errors. In practice, X can be assessed by estimating the variance 
components in some data set and this may need to be done only 
once; the optimum I in cross-validation is often close to what one 
obtains from estimating X in the training set (de los Campos, 
personal communication), but regularization depends on the pjn 
ratio, so studies from other studies with different sample sizes (even 
from the same population) may not provide a good guide to attain 
optimum regularization in a given problem. 
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